Impact of Molecular Descriptors on Solubility Prediction in Chemical Compounds¶

Solubility is a fundamental property in drug design and environmental science, influencing a compound's absorption, distribution, metabolism, and excretion (Lipinski et al., 2001). The dataset includes numerous molecular descriptors such as molecular weight, LogP, number of hydrogen bond acceptors and donors, topological polar surface area (TPSA), and more.

Understanding the relationship between these descriptors and solubility can enhance the accuracy of predictive models, leading to better drug formulations and environmental risk assessments. Previous studies have demonstrated that descriptors like LogP and TPSA significantly influence solubility due to their roles in intermolecular interactions and molecular surface properties (Hou, Xu & Lee, 2009).

In this analysis, we will apply machine learning algorithms to determine the predictive power of these descriptors, identify key predictors, and evaluate the model's performance using metrics such as R-squared and mean absolute error (MAE). This approach not only aids in theoretical understanding but also has practical implications for chemical compound design and environmental safety assessments.

Data Cleaning & Preprocessing¶

Data cleaning and preprocessing are critical steps in data analysis and machine learning, ensuring the accuracy and quality of the dataset used for modeling (Zhu et al., 2019). The dataset under consideration comprises various chemical compounds and their properties, such as molecular weight, solubility, and topological polar surface area, among others. Effective preprocessing techniques transform raw data into a suitable format, addressing issues such as missing values, inconsistencies, and irrelevant information, thereby enhancing the dataset's usability for subsequent analysis (Kotsiantis, Kanellopoulos & Pintelas, 2006). The dataset consists of chemical compounds and their properties, below is the data dictionary;

  1. ID: Unique identifier for each compound.
  2. Name: Chemical name of the compound.
  3. InChI: IUPAC International Chemical Identifier.
  4. InChIKey: Simplified version of InChI.
  5. SMILES: Simplified Molecular Input Line Entry System notation.
  6. Solubility: Solubility measure of the compound.
  7. SD: Standard deviation of the solubility measure.
  8. Ocurrences: Number of occurrences in the dataset.
  9. Group: Group classification.
  10. MolWt: Molecular weight of the compound.
  11. LogP: Logarithm of the partition coefficient.
  12. HeavyAtomCount: Count of heavy atoms.
  13. NumHAcceptors: Number of hydrogen bond acceptors.
  14. NumHDonors: Number of hydrogen bond donors.
  15. NumHeteroatoms: Number of heteroatoms.
  16. NumRotatableBonds: Number of rotatable bonds.
  17. NumValenceElectrons: Number of valence electrons.
  18. NumAromaticRings: Number of aromatic rings.
  19. NumSaturatedRings: Number of saturated rings.
  20. NumAliphaticRings: Number of aliphatic rings.
  21. RingCount: Total ring count.
  22. TPSA: Topological Polar Surface Area.
  23. LabuteASA: Labute's Approximation to Surface Area.
  24. BalabanJ: Balaban's index.
  25. BertzCT: Bertz complexity index.
In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
In [12]:
df = pd.read_csv('MS4S16_Resit_Dataset.csv')
df
Out[12]:
ID Name InChI InChIKey SMILES Solubility SD Ocurrences Group MolWt ... NumRotatableBonds NumValenceElectrons NumAromaticRings NumSaturatedRings NumAliphaticRings RingCount TPSA LabuteASA BalabanJ BertzCT
0 A-3 N,N,N-trimethyloctadecan-1-aminium bromide InChI=1S/C21H46N.BrH/c1-5-6-7-8-9-10-11-12-13-... SZEMGTQCPRNXEG-UHFFFAOYSA-M [Br-].CCCCCCCCCCCCCCCCCC[N+](C)(C)C -3.616127 0.000000 1 G1 392.510 ... 17.0 142.0 0.0 0.0 0.0 0.0 0.00 158.520601 0.000000e+00 210.377334
1 A-4 Benzo[cd]indol-2(1H)-one InChI=1S/C11H7NO/c13-11-8-5-1-3-7-4-2-6-9(12-1... GPYLCFQEKPUWLD-UHFFFAOYSA-N O=C1Nc2cccc3cccc1c23 -3.254767 0.000000 1 G1 169.183 ... 0.0 62.0 2.0 0.0 1.0 3.0 29.10 75.183563 2.582996e+00 511.229248
2 A-5 4-chlorobenzaldehyde InChI=1S/C7H5ClO/c8-7-3-1-6(5-9)2-4-7/h1-5H AVPYQKSLYISFPO-UHFFFAOYSA-N Clc1ccc(C=O)cc1 -2.177078 0.000000 1 G1 140.569 ... 1.0 46.0 1.0 0.0 0.0 1.0 17.07 58.261134 3.009782e+00 202.661065
3 A-8 zinc bis[2-hydroxy-3,5-bis(1-phenylethyl)benzo... InChI=1S/2C23H22O3.Zn/c2*1-15(17-9-5-3-6-10-17... XTUPUYCJWKHGSW-UHFFFAOYSA-L [Zn++].CC(c1ccccc1)c2cc(C(C)c3ccccc3)c(O)c(c2)... -3.924409 0.000000 1 G1 756.226 ... 10.0 264.0 6.0 0.0 0.0 6.0 120.72 323.755434 2.322963e-07 1964.648666
4 A-9 4-({4-[bis(oxiran-2-ylmethyl)amino]phenyl}meth... InChI=1S/C25H30N2O4/c1-5-20(26(10-22-14-28-22)... FAUAZXVRLVIARB-UHFFFAOYSA-N C1OC1CN(CC2CO2)c3ccc(Cc4ccc(cc4)N(CC5CO5)CC6CO... -4.662065 0.000000 1 G1 422.525 ... 12.0 164.0 2.0 4.0 4.0 6.0 56.60 183.183268 1.084427e+00 769.899934
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9977 I-84 tetracaine InChI=1S/C15H24N2O2/c1-4-5-10-16-14-8-6-13(7-9... GKCBAIGFKIBETG-UHFFFAOYSA-N C(c1ccc(cc1)NCCCC)(=O)OCCN(C)C -3.010000 0.000000 1 G1 264.369 ... 8.0 106.0 1.0 0.0 0.0 1.0 41.57 115.300645 2.394548e+00 374.236893
9978 I-85 tetracycline InChI=1S/C22H24N2O8/c1-21(31)8-5-4-6-11(25)12(... OFVLGDICTFRJMM-WESIUVDSSA-N OC1=C(C(C2=C(O)[C@@](C(C(C(N)=O)=C(O)[C@H]3N(C... -2.930000 0.000000 1 G1 444.440 ... 2.0 170.0 1.0 0.0 3.0 4.0 181.62 182.429237 2.047922e+00 1148.584975
9979 I-86 thymol InChI=1S/C10H14O/c1-7(2)9-5-4-8(3)6-10(9)11/h4... MGSRCZKZVOBKFT-UHFFFAOYSA-N c1(cc(ccc1C(C)C)C)O -2.190000 0.019222 3 G5 150.221 ... 1.0 60.0 1.0 0.0 0.0 1.0 20.23 67.685405 3.092720e+00 251.049732
9980 I-93 verapamil InChI=1S/C27H38N2O4/c1-20(2)27(19-28,22-10-12-... SGTNSNPWRIOYBX-UHFFFAOYSA-N COc1ccc(CCN(C)CCCC(C#N)(C(C)C)c2ccc(OC)c(OC)c2... -3.980000 0.000000 1 G1 454.611 ... 13.0 180.0 2.0 0.0 0.0 2.0 63.95 198.569223 2.023333e+00 938.203977
9981 I-94 warfarin InChI=1S/C19H16O4/c1-12(20)11-15(13-7-3-2-4-8-... PJVWKTKQMONHTI-UHFFFAOYSA-N CC(=O)CC(c1ccccc1)c1c(O)c2ccccc2oc1=O -4.780000 0.450506 3 G5 308.333 ... 4.0 116.0 3.0 0.0 0.0 3.0 67.51 132.552025 2.258072e+00 909.550973

9982 rows × 26 columns

In [13]:
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numerical_cols
Out[13]:
['Solubility',
 'SD',
 'Ocurrences',
 'MolWt',
 'MolLogP',
 'MolMR',
 'HeavyAtomCount',
 'NumHAcceptors',
 'NumHDonors',
 'NumHeteroatoms',
 'NumRotatableBonds',
 'NumValenceElectrons',
 'NumAromaticRings',
 'NumSaturatedRings',
 'NumAliphaticRings',
 'RingCount',
 'TPSA',
 'LabuteASA',
 'BalabanJ',
 'BertzCT']

As part of data preprocessing, I have been able to identify 20 columns out of the 26 columns in the dataset which contains numerical values. These are; 'Solubility', 'SD', 'Ocurrences', 'MolWt', 'MolLogP', 'MolMR', 'HeavyAtomCount', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds', 'NumValenceElectrons', 'NumAromaticRings', 'NumSaturatedRings', 'NumAliphaticRings', 'RingCount', 'TPSA', 'LabuteASA', 'BalabanJ', and 'BertzCT'.

In [14]:
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_cols
Out[14]:
['ID', 'Name', 'InChI', 'InChIKey', 'SMILES', 'Group']

Following the 20 numerical columns Identified above, we then identified the remaining 6 categorical columns, i.e columns with non-numeric values. They are; 'ID', 'Name', 'InChI', 'InChIKey', 'SMILES', and 'Group'.

In [15]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9982 entries, 0 to 9981
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   9982 non-null   object 
 1   Name                 9982 non-null   object 
 2   InChI                9982 non-null   object 
 3   InChIKey             9982 non-null   object 
 4   SMILES               9982 non-null   object 
 5   Solubility           9982 non-null   float64
 6   SD                   9982 non-null   float64
 7   Ocurrences           9982 non-null   int64  
 8   Group                9982 non-null   object 
 9   MolWt                9982 non-null   float64
 10  MolLogP              9982 non-null   float64
 11  MolMR                9982 non-null   float64
 12  HeavyAtomCount       9982 non-null   float64
 13  NumHAcceptors        9982 non-null   float64
 14  NumHDonors           9982 non-null   float64
 15  NumHeteroatoms       9982 non-null   float64
 16  NumRotatableBonds    9982 non-null   float64
 17  NumValenceElectrons  9982 non-null   float64
 18  NumAromaticRings     9982 non-null   float64
 19  NumSaturatedRings    9982 non-null   float64
 20  NumAliphaticRings    9982 non-null   float64
 21  RingCount            9982 non-null   float64
 22  TPSA                 9982 non-null   float64
 23  LabuteASA            9982 non-null   float64
 24  BalabanJ             9982 non-null   float64
 25  BertzCT              9982 non-null   float64
dtypes: float64(19), int64(1), object(6)
memory usage: 2.0+ MB

Looking through the dataset, I noticed that the data were 'not-null' meaning that null values were absent. Also, I also noticed the datatypes of the columns were in check and we are ready to proceed to the rest of the analysis.

Descriptive Statistics & Univariate Analysis¶

Descriptive statistics provide a summary of the central tendency, dispersion, and shape of a dataset’s distribution. Common measures include mean, median, mode, variance, standard deviation, and range (Field, 2013). These statistics help in understanding the basic features of the data, offering insights into its structure and variability.

On the other hand, univariate analysis focuses on examining the distribution of a single variable. It includes visualizations such as histograms, box plots, and bar charts, which highlight patterns, outliers, and the overall spread of the data (Weinberg & Abramowitz, 2008). Applying descriptive statistics and univariate analysis to the dataset helps identify key characteristics and initial patterns in the chemical properties. Below we would find the descriptive statistcs and univariate analysis for this data.

In [16]:
df.describe()
Out[16]:
Solubility SD Ocurrences MolWt MolLogP MolMR HeavyAtomCount NumHAcceptors NumHDonors NumHeteroatoms NumRotatableBonds NumValenceElectrons NumAromaticRings NumSaturatedRings NumAliphaticRings RingCount TPSA LabuteASA BalabanJ BertzCT
count 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000 9982.000000
mean -2.889909 0.067449 1.378081 266.665946 1.979167 66.794594 17.374674 3.486776 1.108595 5.196955 4.073031 94.243438 1.068323 0.292627 0.447606 1.515929 62.458601 108.912586 2.392199 467.336782
std 2.368154 0.234702 1.023476 184.179024 3.517738 46.523021 12.241536 3.498203 1.488973 4.736275 5.646925 64.748563 1.309427 0.879599 1.054667 1.644334 63.348307 76.462726 1.091123 546.631696
min -13.171900 0.000000 1.000000 9.012000 -40.873200 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 7.504228 -0.000004 0.000000
25% -4.326325 0.000000 1.000000 162.276000 0.619750 40.602475 11.000000 2.000000 0.000000 3.000000 1.000000 58.000000 0.000000 0.000000 0.000000 0.000000 26.300000 66.623721 2.004357 163.243343
50% -2.618173 0.000000 1.000000 228.682000 1.949220 58.633400 15.000000 3.000000 1.000000 4.000000 3.000000 82.000000 1.000000 0.000000 0.000000 1.000000 50.720000 93.299873 2.539539 351.640433
75% -1.209735 0.000000 1.000000 320.436000 3.419030 81.938000 21.000000 4.000000 2.000000 6.000000 5.000000 112.000000 2.000000 0.000000 1.000000 2.000000 80.390000 129.118374 3.032456 606.562848
max 2.137682 3.870145 38.000000 5299.456000 68.541140 1419.351700 388.000000 86.000000 26.000000 89.000000 141.000000 2012.000000 35.000000 30.000000 30.000000 36.000000 1214.340000 2230.685124 7.517310 20720.267708

The dataset, comprising 7,816 observations, provides a comprehensive overview of various physical and chemical properties of compounds. Each feature presents unique insights, reflecting the diversity within the dataset.

The solubility values exhibit a wide range with a mean of -2.65 and a standard deviation of 2.20. The values span from -9.98 to 2.14, with a median at -2.43, indicating diverse solubility properties among the compounds. The standard deviation (SD) values range from 0.00 to 0.77, with a mean of 0.03 and a standard deviation of 0.10, suggesting that most compounds have relatively consistent measurements.

The occurrence of compounds in the dataset has a mean of 1.24 with a standard deviation of 0.53. The minimum(lowest value) is 1.00 and the maximum(highest value) is 3.00, indicating that most compounds appear only once. Molecular weight (MolWt) varies significantly, with a mean of 219.31 and a standard deviation of 88.91, ranging from 9.01 to 785.38. The median is 209.23, highlighting the presence of both small and large molecules.

MolLogP values, reflecting the lipophilicity of the compounds, have a mean of 1.90 and a standard deviation of 2.13. The values range from -6.59 to 9.45, with a median of 1.86, indicating a broad range of lipophilicity. Molecular refractivity (MolMR) shows considerable variability, with a mean of 55.67 and a standard deviation of 23.67. The values range from 0.00 to 134.39, with a median of 53.85, indicating different levels of interaction with solvents.

The number of heavy atoms ranges from 1.00 to 33.00, with a mean of 14.23 and a standard deviation of 5.94. The median value is 14.00, reflecting the structural complexity of the molecules. The mean number of hydrogen bond acceptors (NumHAcceptors) is 2.78, with a standard deviation of 1.88, ranging from 0.00 to 9.00, with a median of 3.00, indicating varying chemical properties.

Hydrogen bond donors (NumHDonors) are fewer in number, with a mean of 0.91 and a standard deviation of 1.01. The values range from 0.00 to 4.00, with a median of 1.00, showing a moderate spread. The number of heteroatoms (NumHeteroatoms) varies widely, with a mean of 4.25 and a standard deviation of 2.41. The values range from 0.00 to 12.00, with a median of 4.00, affecting the reactivity and properties of the compounds.

The number of rotatable bonds (NumRotatableBonds), indicating molecular flexibility, has a mean of 3.03 and a standard deviation of 2.70. The values range from 0.00 to 13.00, with a median of 2.00. Valence electrons (NumValenceElectrons) vary greatly with a mean of 77.29 and a standard deviation of 30.76. The values range from 0.00 to 178.00, with a median of 76.00, influencing the chemical reactivity of the compounds.

The number of aromatic rings (NumAromaticRings) ranges from 0.00 to 4.00, with a mean of 0.96 and a standard deviation of 0.93. The median value is 1.00, affecting the stability and electronic properties of the compounds. Saturated rings (NumSaturatedRings) are less common, with a mean of 0.17 and a standard deviation of 0.43. The values range from 0.00 to 2.00, with a **median of 0.00.

Aliphatic rings (NumAliphaticRings) are also less common, with a mean of 0.26 and a standard deviation of 0.52. The values range from 0.00 to 2.00, with a median of 0.00. The total ring count (RingCount) varies, with a mean of 1.22 and a standard deviation of 1.03. The values range from 0.00 to 4.00, with a median of 1.00, reflecting different levels of molecular complexity.

Topological Polar Surface Area (TPSA) values have a mean of 49.65 and a standard deviation of 33.91. The values range from 0.00 to 155.91, with a median of 46.53, influencing the permeability and solubility of the compounds. LabuteASA values indicate significant variability, with a mean of 88.78 and a standard deviation of 34.72. The values range from 7.58 to 196.99, with a median of 85.26, indicating the approximate surface area.

The Balaban index (BalabanJ) values suggest diverse levels of molecular connectivity, with a mean of 2.57 and a standard deviation of 0.92. The values range from -0.00 to 5.30, with a median of 2.65. The complexity of the molecules, as indicated by BertzCT, varies widely. The mean is 356.93 with a standard deviation of 257.06. The values range from 0.00 to 1156.26, with a median of 311.43.

In [17]:
df[numerical_cols].hist(figsize=(12, 10))
plt.show()
In [18]:
def remove_outliers(df, col):
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

def annotate_boxplot(ax, data, col):
    min_val = data.min()
    q1 = data.quantile(0.25)
    median = data.median()
    q3 = data.quantile(0.75)
    max_val = data.max()
    
    textstr = f'Min: {min_val:.2f}\nQ1: {q1:.2f}\nMedian: {median:.2f}\nQ3: {q3:.2f}\nMax: {max_val:.2f}'
    props = dict(boxstyle='round,pad=0.3', edgecolor='black', facecolor='none')
    ax.text(0.05, -0.1, textstr, transform=ax.transAxes, fontsize=10,
            verticalalignment='top', bbox=props)

numerical_cols = df.select_dtypes(include='number').columns

for col in numerical_cols:
    plt.figure(figsize=(20, 5))
    ax = sns.boxplot(x=df[col])
    annotate_boxplot(ax, df[col], col)
    plt.title(f'Boxplot of {col}')
    plt.show()

Next, we went ahead to explore the dataset, the EDA conducted on the dataset reveals significant insights into the distribution and variability of the physical and chemical properties of the compounds. The histograms provide a visual representation of the distribution of each feature, highlighting the diversity within the dataset.

For solubility, the values are mostly concentrated around -2 to 0, with a notable spread towards negative values. The standard deviation of the solubility measure is predominantly zero, indicating that most compounds have consistent measurements. Most compounds occur only once, with a few appearing two or three times. Molecular weight exhibits a right-skewed distribution, indicating the presence of several heavier compounds. The log of the partition coefficient (MolLogP) shows a normal distribution centered around zero, with a wide spread. Molecular refractivity (MolMR) has a broad distribution, indicating diverse interaction levels with solvents.

The number of heavy atoms shows a normal distribution, peaking around 14 atoms. The number of hydrogen bond acceptors varies widely, with a peak at around three acceptors. Most compounds have zero or one hydrogen bond donor, with fewer having more. The number of heteroatoms peaks at around four, indicating a diverse set of compounds. Rotatable bonds are mostly between zero and four, showing varying molecular flexibility. Valence electrons exhibit a broad distribution, peaking around 70 to 80 electrons. Most compounds have zero to two aromatic rings. Saturated rings are less common, with most compounds having zero. Similar to saturated rings, most compounds have zero aliphatic rings. The total ring count is mostly between zero and two, reflecting different levels of molecular complexity. Topological polar surface area (TPSA) varies widely, affecting compound permeability and solubility. Approximate surface area values (LabuteASA) are broadly distributed, indicating significant variability. The molecular connectivity index (BalabanJ) shows a normal distribution, peaking around 2.5. Molecular complexity (BertzCT) varies widely, with a right-skewed distribution.

Box Plots¶

The box plots provide a detailed summary of the data's distribution, highlighting the median, quartiles, and potential outliers. The median solubility is -2.43, with an interquartile range (IQR) from -3.94 to -1.15. Several outliers are observed on the negative side. The median standard deviation is zero, with most values clustering around zero and a few outliers. Most compounds occur once, with no significant outliers.

The median molecular weight is 209.23, with an IQR from 152.24 to 276.29. Several high outliers are present. The median MolLogP is 1.86, with an IQR from 0.71 to 3.11. Both high and low outliers are observed. The median molecular refractivity is 53.85, with an IQR from 38.89 to 71.95, and high outliers are noted. The median number of heavy atoms is 14, with an IQR from 10 to 18, and a few high outliers. The median number of hydrogen bond acceptors is three, with an IQR from one to four, and some high outliers. The median number of hydrogen bond donors is one, with an IQR from zero to two, and outliers extending to four donors. The median number of heteroatoms is four, with an IQR from two to six, with no significant outliers.

The median number of rotatable bonds is two, with an IQR from one to four, and high outliers up to 13 bonds. The median number of valence electrons is 76, with an IQR from 54 to 98, and some high outliers. The median number of aromatic rings is one, with an IQR from zero to two, and outliers extending up to four rings. Most values for saturated and aliphatic rings are zero, with some compounds having up to two rings. The median ring count is one, with an IQR from zero to two, and high outliers. The median TPSA is 46.53, with an IQR from 23.79 to 72.91, and outliers extending to high values. The median LabuteASA is 85.26, with an IQR from 63.14 to 112.52, and high outliers. The median BalabanJ is 2.65, with an IQR from 2.22 to 3.07, and outliers extending to higher values. The median BertzCT is 311.43, with an IQR from `145.58 to 533.85, and high outliers.

In [8]:
plt.figure(figsize = (20,8))
sns.histplot(df['NumAromaticRings'], kde=True)
plt.title('NumAromaticRings Distribution')
plt.ylabel("Frequency")
plt.show()
In [9]:
plt.figure(figsize = (20,8))
sns.histplot(df['NumAromaticRings'], kde=True)
plt.title('NumAromaticRings Distribution')
plt.ylabel("Frequency")
plt.show()
In [10]:
plt.figure(figsize = (20,8))
sns.histplot(df['SD'], kde=True)
plt.title('SD Distribution')
plt.ylabel("Frequency")
plt.show()
In [11]:
plt.figure(figsize = (20,8))
sns.histplot(df['Ocurrences'], kde=True)
plt.title('Ocurrences Distribution')
plt.ylabel("Frequency")
plt.show()
In [12]:
df.query('Ocurrences > 6')
Out[12]:
ID Name InChI InChIKey SMILES Solubility SD Ocurrences Group MolWt ... NumRotatableBonds NumValenceElectrons NumAromaticRings NumSaturatedRings NumAliphaticRings RingCount TPSA LabuteASA BalabanJ BertzCT
30 A-45 hydroxylamine InChI=1S/H3NO/c1-2/h2H,1H2 AVXURJPOCDRRFD-UHFFFAOYSA-N NO -0.763034 0.861298 7 G4 33.030 ... 0.0 14.0 0.0 0.0 0.0 0.0 46.25 12.462472 1.000000e+00 2.000000
87 A-142 1-methyl-4-(prop-1-en-2-yl)cyclohex-1-ene InChI=1S/C10H16/c1-8(2)10-6-4-9(3)5-7-10/h4,10... XMGQYMWWDOXHJM-UHFFFAOYSA-N CC(=C)C1CCC(=CC1)C -3.846497 0.730589 9 G4 136.238 ... 1.0 56.0 0.0 0.0 1.0 1.0 0.00 63.638693 2.495412e+00 162.877124
151 A-254 calcium dihydroxide InChI=1S/Ca.2H2O/h;2*1H2/q+2;;/p-2 AXCZMVOFGPJBDE-UHFFFAOYSA-L [OH-].[OH-].[Ca++] -1.910730 0.672750 7 G4 74.092 ... 0.0 16.0 0.0 0.0 0.0 0.0 60.00 48.445753 -0.000000e+00 2.754888
226 A-382 triethylamine InChI=1S/C6H15N/c1-4-7(5-2)6-3/h4-6H2,1-3H3 ZMANZCXQSJIPKH-UHFFFAOYSA-N CCN(CC)CC -0.137683 0.335493 7 G5 101.193 ... 3.0 44.0 0.0 0.0 0.0 0.0 3.24 46.323795 2.992303e+00 25.651484
238 A-400 Hydrocarbons, C5-rich InChI=1S/C5H12/c1-3-5-4-2/h3-5H2,1-2H3 OFBQJSOFQDEBGM-UHFFFAOYSA-N CCCCC -3.006984 0.421426 8 G5 72.151 ... 2.0 32.0 0.0 0.0 0.0 0.0 0.00 34.199019 2.190610e+00 7.509775
292 A-491 benzene-1,2-dicarboxylic acid InChI=1S/C8H6O4/c9-7(10)5-3-1-2-4-6(5)8(11)12/... XNGIFLGASWRNHJ-UHFFFAOYSA-N OC(=O)c1ccccc1C(O)=O -1.424573 0.329865 8 G5 166.132 ... 2.0 62.0 1.0 0.0 0.0 1.0 74.60 68.072799 3.267076e+00 296.584675
459 A-808 5-methyl-2-(propan-2-yl)cyclohexan-1-ol InChI=1S/C10H20O/c1-7(2)9-5-4-8(3)6-10(9)11/h7... NOOLISFMXDJSKH-KXUCPTDWSA-N CC(C)[C@@H]1CC[C@@H](C)C[C@H]1O -2.570624 0.045118 7 G5 156.269 ... 1.0 66.0 0.0 1.0 1.0 1.0 20.23 69.812133 2.437159e+00 120.041185
534 A-954 1,2,4-trichlorobenzene InChI=1S/C6H3Cl3/c7-4-1-2-5(8)6(9)3-4/h1-3H PBKONEOXTCPAFI-UHFFFAOYSA-N Clc1ccc(Cl)c(Cl)c1 -3.702452 0.124042 8 G5 181.449 ... 0.0 48.0 1.0 0.0 0.0 1.0 0.00 68.341202 3.171678e+00 219.543000
665 A-1156 ethenylbenzene InChI=1S/C8H8/c1-2-8-6-4-3-5-7-8/h2-7H,1H2 PPBRXRYQALVLMV-UHFFFAOYSA-N C=Cc1ccccc1 -2.555270 0.104808 7 G5 104.152 ... 1.0 40.0 1.0 0.0 0.0 1.0 0.00 49.471684 2.991047e+00 162.653001
751 A-1308 butylbenzene InChI=1S/C10H14/c1-3-9(2)10-7-5-4-6-8-10/h4-9H... ZJMWRROPUADPEA-UHFFFAOYSA-N CCC(C)c1ccccc1 -3.758608 0.264324 7 G5 134.222 ... 2.0 54.0 1.0 0.0 0.0 1.0 0.00 62.891172 2.746978e+00 176.489615
814 A-1414 oxostibanyl stibinate InChI=1S/3O.2Sb/q3*-2;2*+3 GHPGOEFPKIHBNM-UHFFFAOYSA-N [O--].[O--].[O--].[Sb+3].[Sb+3] -5.023755 0.653110 10 G4 291.517 ... 0.0 28.0 0.0 0.0 0.0 0.0 85.50 64.951844 -0.000000e+00 4.854753
835 A-1450 calcium bis(12-hydroxyoctadecanoate) InChI=1S/2C18H36O3.Ca/c2*1-2-3-4-11-14-17(19)1... RXPKHKBYUIHIGL-UHFFFAOYSA-L [Ca++].CCCCCCC(O)CCCCCCCCCCC([O-])=O.CCCCCCC(O... -5.463097 1.000831 13 G4 639.028 ... 32.0 252.0 0.0 0.0 0.0 0.0 120.72 296.133803 -7.272727e-07 511.858709
918 A-1579 2,3-dihydro-1,2-benzothiazol-3-one InChI=1S/C7H5NOS/c9-7-5-3-1-2-4-6(5)10-8-7/h1-... DMSMPAJRVJJAGA-UHFFFAOYSA-N O=C1NSc2ccccc12 -1.947290 0.235013 12 G5 151.190 ... 0.0 50.0 2.0 0.0 0.0 2.0 32.86 61.275691 3.070615e+00 400.898353
967 A-1679 1,3-diphenylguanidine InChI=1S/C13H13N3/c14-13(15-11-7-3-1-4-8-11)16... OWRCNXZUPFZXOS-UHFFFAOYSA-N NC(Nc1ccccc1)=Nc2ccccc2 -2.157516 1.067788 8 G4 211.268 ... 2.0 80.0 2.0 0.0 0.0 2.0 50.41 94.638515 2.090928e+00 463.051084
1180 A-2032 2,6-dibromo-4-[2-(3,5-dibromo-4-hydroxyphenyl)... InChI=1S/C15H12Br4O2/c1-15(2,7-3-9(16)13(20)10... VEORPZCZECFIRK-UHFFFAOYSA-N CC(C)(c1cc(Br)c(O)c(Br)c1)c2cc(Br)c(O)c(Br)c2 -5.694106 0.635897 7 G4 543.875 ... 2.0 112.0 2.0 0.0 0.0 2.0 40.46 156.641982 2.614907e+00 605.305185
1241 A-2147 tetradecan-1-ol InChI=1S/C14H30O/c1-2-3-4-5-6-7-8-9-10-11-12-1... HLZKNKRTKFSKGZ-UHFFFAOYSA-N CCCCCCCCCCCCCCO -5.787143 0.301618 7 G5 214.393 ... 12.0 92.0 0.0 0.0 0.0 0.0 20.23 96.277732 2.807926e+00 89.511823
1300 A-2257 1,2-bis(2-ethylhexyl) benzene-1,2-dicarboxylate InChI=1S/C24H38O4/c1-5-9-13-19(7-3)17-27-23(25... BJQHLKABXJIVAM-UHFFFAOYSA-N CCCCC(CC)COC(=O)c1ccccc1C(=O)OCC(CC)CCCC -6.978908 0.613829 7 G4 390.564 ... 14.0 158.0 1.0 0.0 0.0 1.0 52.60 170.550496 2.690590e+00 530.722810
1368 A-2377 4,7,7-trimethylbicyclo[3.1.1]hept-3-ene InChI=1S/C10H16/c1-7-4-5-8-6-9(7)10(8,2)3/h4,8... GRWFGVWFFZKLTI-UHFFFAOYSA-N CC1=CCC2CC1C2(C)C -3.772570 0.698272 10 G4 136.238 ... 0.0 56.0 0.0 1.0 3.0 3.0 0.00 63.322465 2.296627e+00 186.214991
1857 A-3047 1,2-dichlorobenzene InChI=1S/C6H4Cl2/c7-5-3-1-2-4-6(5)8/h1-4H RFFLAFLAYFXFSW-UHFFFAOYSA-N Clc1ccccc1Cl -3.053386 0.079736 8 G5 147.004 ... 0.0 42.0 1.0 0.0 0.0 1.0 0.00 58.037936 3.134862e+00 162.638339
1983 A-3266 sodium formate InChI=1S/CH2O2.Na/c2-1-3;/h1H,(H,2,3);/q;+1/p-1 HLBBKKJFGFRGMU-UHFFFAOYSA-M [Na+].[O-]C=O 1.012544 0.128539 7 G5 68.007 ... 0.0 18.0 0.0 0.0 0.0 0.0 40.13 46.108407 0.000000e+00 13.509775
1985 A-3273 acid D,L-aspart InChI=1S/C4H7NO4/c5-2(4(8)9)1-3(6)7/h2H,1,5H2,... CKLJMWTZIZZHCS-UHFFFAOYSA-N NC(CC(O)=O)C(O)=O -1.229318 0.359384 8 G5 133.103 ... 3.0 52.0 0.0 0.0 0.0 0.0 100.62 51.085480 3.632432e+00 132.529325
2017 A-3330 iron(3+) chloride sulfate InChI=1S/Cl.Fe.H2O4S/c;;1-5(2,3)4/h;;(H2,1,2,3... NGYBMUURDZXEEA-UHFFFAOYSA-L [Cl].[Fe].[O-][S]([O-])(=O)=O 0.540234 0.378887 8 G5 187.361 ... 0.0 47.0 0.0 0.0 0.0 0.0 80.26 57.754896 -8.000000e-08 94.858202
2018 A-3331 iron(3+) ion trichloride InChI=1S/3ClH.Fe/h3*1H;/q;;;+3/p-3 RBTARNINKXHZNM-UHFFFAOYSA-K [Cl-].[Cl-].[Cl-].[Fe+3] 0.602852 0.378887 8 G5 162.204 ... 0.0 29.0 0.0 0.0 0.0 0.0 0.00 54.423378 -0.000000e+00 3.245112
2019 A-3333 iron(+2) cation sulfate InChI=1S/Fe.H2O4S/c;1-5(2,3)4/h;(H2,1,2,3,4)/q... BAUYGSIQEAFULO-UHFFFAOYSA-L [Fe++].[O-][S]([O-])(=O)=O 0.631333 0.378887 8 G5 151.908 ... 0.0 38.0 0.0 0.0 0.0 0.0 80.26 45.601803 0.000000e+00 90.716493
2089 A-3446 (2R,3R)-2,3-dihydroxybutanedioic acid InChI=1S/C4H6O6/c5-1(3(7)8)2(6)4(9)10/h1-2,5-6... FEWJPZIEWOKRBE-UHFFFAOYSA-N OC(C(O)C(O)=O)C(O)=O 0.571848 0.682833 10 G4 150.086 ... 3.0 58.0 0.0 0.0 0.0 0.0 115.06 55.334057 4.069512e+00 133.826806
2091 A-3448 potassium hydrogen tartarate InChI=1S/C4H6O6.K/c5-1(3(7)8)2(6)4(9)10;/h1-2,... KYKNRZGSIGMXFH-ZVGUSBNCSA-M [K+].O[C@H]([C@@H](O)C([O-])=O)C(O)=O 0.473624 0.753304 8 G4 188.176 ... 3.0 58.0 0.0 0.0 0.0 0.0 117.89 104.256621 0.000000e+00 138.661273
2733 A-4559 3-bromo-1-(3-chloropyridin-2-yl)-1H-pyrazole-5... InChI=1S/C9H5BrClN3O2/c10-7-4-6(9(15)16)14(13-... FORBXGROTPOMEH-UHFFFAOYSA-N OC(=O)c1cc(Br)nn1c2ncccc2Cl -1.448935 1.502581 7 G4 302.515 ... 2.0 82.0 2.0 0.0 0.0 2.0 68.01 103.806891 2.643285e+00 555.437168
2887 A-4830 chlorobenzene InChI=1S/C6H5Cl/c7-6-4-2-1-3-5-6/h1-5H MVPPADPHJFYWMZ-UHFFFAOYSA-N Clc1ccccc1 -2.449320 0.132559 8 G5 112.559 ... 0.0 36.0 1.0 0.0 0.0 1.0 0.00 47.734669 3.021465e+00 134.107370
3002 A-5028 methyl 2-[(4-ethoxy-6-methylamino-1,3,5-triazi... InChI=1S/C15H18N6O6S/c1-4-27-15-19-12(16-2)17-... ZINJLDJMHCUBIP-UHFFFAOYSA-N CCOc1nc(NC)nc(NC(=O)N[S](=O)(=O)c2ccccc2C(=O)O... -4.387911 1.388382 8 G4 410.412 ... 7.0 150.0 2.0 0.0 0.0 2.0 161.50 160.297591 2.235189e+00 980.736675
3387 A-5577 dioxotungsten InChI=1S/2O.W DZKDPOPGYFUOGI-UHFFFAOYSA-N O=[W]=O -5.955730 0.637644 9 G4 215.838 ... 0.0 18.0 0.0 0.0 0.0 0.0 34.14 25.984794 3.265986e+00 23.774438
3393 A-5587 cobalt sulphide InChI=1S/Co.S/q+2;-2 INPLXZPZQSLHBR-UHFFFAOYSA-N [S--].[Co++] -5.015250 1.028126 27 G4 91.000 ... 0.0 15.0 0.0 0.0 0.0 0.0 0.00 30.881366 0.000000e+00 2.000000
3394 A-5589 methane; vanadium InChI=1S/CH4.V/h1H4; GORXZVFEOLUTMI-UHFFFAOYSA-N C.[V] -6.205842 0.982342 8 G4 66.985 ... 0.0 13.0 0.0 0.0 0.0 0.0 0.00 26.747394 0.000000e+00 2.000000
3427 A-5674 Aluminum;phosphenic acid InChI=1S/Al.3H3O3P/c;3*1-4(2)3/h;3*4H,(H2,1,2,... PGOXTIQAMARHNF-UHFFFAOYSA-K [Al+3].O[PH]([O-])=O.O[PH]([O-])=O.O[PH]([O-])=O -4.770881 1.247611 9 G4 269.943 ... 0.0 78.0 0.0 0.0 0.0 0.0 181.08 85.659023 -4.500000e-08 112.409479
3431 A-5680 cobaltoylol InChI=1S/Co.H2O.O/h;1H2;/q+1;;/p-1 DLHSXQSAISCVNN-UHFFFAOYSA-M O[Co]=O -6.275971 0.991201 16 G4 91.939 ... 0.0 22.0 0.0 0.0 0.0 0.0 37.30 23.955846 2.187496e+00 10.264663
3433 A-5686 cobalt InChI=1S/Co GUTLYIVDDKVIGB-UHFFFAOYSA-N [Co] -5.117146 1.364118 38 G4 58.933 ... 0.0 9.0 0.0 0.0 0.0 0.0 0.00 17.688511 0.000000e+00 0.000000
3451 A-5731 cobalt(2+) oxalate InChI=1S/C2H2O4.Co/c3-1(4)2(5)6;/h(H,3,4)(H,5,... MULYSYXKGICWJF-UHFFFAOYSA-L [Co++].[O-]C(=O)C([O-])=O -4.667788 0.754840 13 G4 146.951 ... 0.0 41.0 0.0 0.0 0.0 0.0 80.26 49.335738 0.000000e+00 75.690584
3499 A-5838 Cobaltous carbonate InChI=1S/CH2O3.Co/c2-1(3)4;/h(H2,2,3,4);/q;+2/p-2 ZOTKGJBKKKVBJZ-UHFFFAOYSA-L [Co++].[O-]C([O-])=O -4.837537 1.007624 16 G4 118.941 ... 0.0 31.0 0.0 0.0 0.0 0.0 63.19 38.809274 0.000000e+00 29.874303
3508 A-5852 Cobaltous 2-ethylhexanoate InChI=1S/2C8H16O2.Co/c2*1-3-5-6-7(4-2)8(9)10;/... QAEKNCDIHIGLFI-UHFFFAOYSA-L [Co++].CCCCC(CC)C([O-])=O.CCCCC(CC)C([O-])=O -5.347978 1.946375 12 G4 345.345 ... 10.0 127.0 0.0 0.0 0.0 0.0 80.26 139.450759 -2.945454e-07 230.661531
3523 A-5888 oxocobalt InChI=1S/Co.O IVMYJDGYRUAWML-UHFFFAOYSA-N O=[Co] -5.798391 1.313112 16 G4 74.932 ... 0.0 15.0 0.0 0.0 0.0 0.0 17.07 20.447360 2.000000e+00 2.000000
3755 B-162 methane InChI=1S/CH4/h1H4 VNWKTOKETHGBQD-UHFFFAOYSA-N C -2.862900 1.591275 13 G4 16.043 ... 0.0 8.0 0.0 0.0 0.0 0.0 0.00 8.739251 0.000000e+00 0.000000
4230 B-900 potassium sodium tartrate InChI=1S/C4H6O6.K.Na/c5-1(3(7)8)2(6)4(9)10;;/h... LJCNRYVRMXRIQR-OLXYHTOASA-L [Na+].[K+].O[C@H]([C@@H](O)C([O-])=O)C([O-])=O 0.398400 0.710847 9 G4 210.158 ... 3.0 58.0 0.0 0.0 0.0 0.0 120.72 132.670021 -4.050000e-07 143.627075
9808 G-776 Lindane InChI=1S/C6H6Cl6/c7-1-2(8)4(10)6(12)5(11)3(1)9... JLYXXMFPNIAWKQ-UHFFFAOYSA-N ClC1C(Cl)C(Cl)C(Cl)C(Cl)C1Cl -4.640000 0.523626 8 G4 290.832 ... 0.0 72.0 0.0 1.0 1.0 1.0 0.00 101.377728 2.760252e+00 103.587975

42 rows × 26 columns

In [13]:
plt.figure(figsize = (20,8))
sns.histplot(df['RingCount'], kde=True)
plt.title('RingCount Distribution')
plt.ylabel("Frequency")
plt.show()
In [14]:
plt.figure(figsize = (20,9))
sns.histplot(df['NumAliphaticRings'], kde=True)
plt.title('NumAliphaticRings Distribution')
plt.ylabel("Frequency")
plt.show()
In [15]:
plt.figure(figsize = (20,9))
sns.histplot(df['NumSaturatedRings'], kde=True)
plt.title('NumSaturatedRings Distribution')
plt.ylabel("Frequency")
plt.show()
In [16]:
plt.figure(figsize=(10, 6))
ax = sns.violinplot(x='Group', y='Solubility', data=df)
plt.title('Violin plot of Solubility by Group')
plt.xlabel('Group')
plt.ylabel('Solubility')
plt.xticks(rotation=45)

# Calculate and annotate median, quartiles, min, and max
groups = df['Group'].unique()
for group in groups:
    data = df[df['Group'] == group]['Solubility']
    median = data.median()
    q1 = data.quantile(0.25)
    q3 = data.quantile(0.75)
    min_val = data.min()
    max_val = data.max()
    
    pos = groups.tolist().index(group)
    ax.annotate(f'Min: {min_val:.2f}', xy=(pos, min_val), xytext=(pos, min_val - 0.5),
                arrowprops=dict(facecolor='orange', arrowstyle='->'), ha='center')
    ax.annotate(f'Q1: {q1:.2f}', xy=(pos, q1), xytext=(pos, q1 - 0.5),
                arrowprops=dict(facecolor='blue', arrowstyle='->'), ha='center')
    ax.annotate(f'Median: {median:.2f}', xy=(pos, median), xytext=(pos, median + 0.5),
                arrowprops=dict(facecolor='black', arrowstyle='->'), ha='center')
    ax.annotate(f'Q3: {q3:.2f}', xy=(pos, q3), xytext=(pos, q3 + 0.5),
                arrowprops=dict(facecolor='green', arrowstyle='->'), ha='center')
    ax.annotate(f'Max: {max_val:.2f}', xy=(pos, max_val), xytext=(pos, max_val + 0.5),
                arrowprops=dict(facecolor='red', arrowstyle='->'), ha='center')

plt.show()

Let's also consider the violin plot above which shoes the distribution of solubility values across five distinct groups: G1, G3, G5, G4, and G2. Each group exhibits unique characteristics in terms of central tendency and variability, providing valuable insights into the solubility behavior of the compounds.

Group G1: The solubility values for Group G1 are heavily concentrated around the median of -2.59, with the distribution extending significantly towards the lower end. This wide range, especially with values dipping as low as -13.17, indicates the presence of substantial outliers, reflecting high variability within this group.

Group G3: Group G3 demonstrates a similar spread to G1, albeit with a slightly higher minimum value of -12.06, suggesting fewer extreme outliers. The central tendency is marginally higher than that of G1, with a median solubility of -2.49, indicating a slightly less negative solubility overall.

Group G5: The solubility values in Group G5 are more tightly clustered around the median of -2.44 compared to G1 and G3. This tighter clustering, indicated by a narrower interquartile range (IQR) from -3.98 to -0.99, suggests less variability and fewer extreme outliers, reflecting more consistency in solubility within this group.

Group G4: Group G4 exhibits a lower median solubility value of -3.81, indicating that compounds in this group generally have lower solubility. The range of values is more compact, suggesting reduced variability and a more homogeneous distribution of solubility characteristics within this group.

Group G2: Group G2 has the lowest median solubility value among all groups, at -4.43, indicating a general trend towards lower solubility. Despite this, the group shows a wide range with fewer extreme outliers, suggesting that while the compounds tend to have lower solubility, the variability is not as pronounced as in some other groups.

Bivariate Analysis¶

Bivariate analysis involves the simultaneous analysis of two variables to determine the empirical relationship between them. This type of analysis can reveal the strength and direction of associations, using techniques such as correlation coefficients, scatter plots, and cross-tabulations (Cohen, Cohen, West & Aiken, 2013).

In the context of this dataset, bivariate analysis is essential for examining how different molecular descriptors relate to solubility. Scatter plot can show the relationship between molecular weight and solubility, while correlation coefficients quantify the strength and direction of these relationships (Dancey & Reidy, 2011), identifying strong correlations can help in selecting relevant features for predictive modeling, thereby enhancing the model's accuracy and efficiency. By conducting bivariate analysis, we can uncover key interactions between variables, providing deeper insights into the factors that significantly impact solubility. This is crucial for developing robust predictive models and making informed decisions in chemical compound design and environmental risk assessment.

In [17]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Find pairs of columns with correlation coefficient greater than 0.45
high_correlation_pairs = []
threshold = 0.45
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            high_correlation_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j]))

# Plot scatter plots for each pair with high correlation
for pair in high_correlation_pairs:
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=pair[0], y=pair[1], data=df)
    plt.title(f'Scatter Plot of {pair[0]} vs {pair[1]}\nCorrelation Coefficient: {correlation_matrix.loc[pair[0], pair[1]]:.2f}')
    plt.xlabel(pair[0])
    plt.ylabel(pair[1])
    plt.grid(True)
    plt.show()

    print(f'Scatter plot for {pair[0]} vs {pair[1]} (Correlation Coefficient: {correlation_matrix.loc[pair[0], pair[1]]:.2f})')
Scatter plot for Solubility vs MolLogP (Correlation Coefficient: -0.61)
Scatter plot for SD vs Ocurrences (Correlation Coefficient: 0.49)
Scatter plot for MolWt vs MolMR (Correlation Coefficient: 0.92)
Scatter plot for MolWt vs HeavyAtomCount (Correlation Coefficient: 0.95)
Scatter plot for MolWt vs NumHAcceptors (Correlation Coefficient: 0.73)
Scatter plot for MolWt vs NumHeteroatoms (Correlation Coefficient: 0.78)
Scatter plot for MolWt vs NumRotatableBonds (Correlation Coefficient: 0.61)
Scatter plot for MolWt vs NumValenceElectrons (Correlation Coefficient: 0.95)
Scatter plot for MolWt vs NumAromaticRings (Correlation Coefficient: 0.57)
Scatter plot for MolWt vs RingCount (Correlation Coefficient: 0.62)
Scatter plot for MolWt vs TPSA (Correlation Coefficient: 0.65)
Scatter plot for MolWt vs LabuteASA (Correlation Coefficient: 0.97)
Scatter plot for MolWt vs BertzCT (Correlation Coefficient: 0.86)
Scatter plot for MolLogP vs MolMR (Correlation Coefficient: 0.49)
Scatter plot for MolMR vs HeavyAtomCount (Correlation Coefficient: 0.97)
Scatter plot for MolMR vs NumHAcceptors (Correlation Coefficient: 0.62)
Scatter plot for MolMR vs NumHeteroatoms (Correlation Coefficient: 0.58)
Scatter plot for MolMR vs NumRotatableBonds (Correlation Coefficient: 0.70)
Scatter plot for MolMR vs NumValenceElectrons (Correlation Coefficient: 0.97)
Scatter plot for MolMR vs NumAromaticRings (Correlation Coefficient: 0.61)
Scatter plot for MolMR vs RingCount (Correlation Coefficient: 0.66)
Scatter plot for MolMR vs TPSA (Correlation Coefficient: 0.50)
Scatter plot for MolMR vs LabuteASA (Correlation Coefficient: 0.96)
Scatter plot for MolMR vs BertzCT (Correlation Coefficient: 0.85)
Scatter plot for HeavyAtomCount vs NumHAcceptors (Correlation Coefficient: 0.74)
Scatter plot for HeavyAtomCount vs NumHeteroatoms (Correlation Coefficient: 0.71)
Scatter plot for HeavyAtomCount vs NumRotatableBonds (Correlation Coefficient: 0.66)
Scatter plot for HeavyAtomCount vs NumValenceElectrons (Correlation Coefficient: 0.99)
Scatter plot for HeavyAtomCount vs NumAromaticRings (Correlation Coefficient: 0.62)
Scatter plot for HeavyAtomCount vs RingCount (Correlation Coefficient: 0.69)
Scatter plot for HeavyAtomCount vs TPSA (Correlation Coefficient: 0.64)
Scatter plot for HeavyAtomCount vs LabuteASA (Correlation Coefficient: 0.98)
Scatter plot for HeavyAtomCount vs BertzCT (Correlation Coefficient: 0.90)
Scatter plot for NumHAcceptors vs NumHDonors (Correlation Coefficient: 0.49)
Scatter plot for NumHAcceptors vs NumHeteroatoms (Correlation Coefficient: 0.89)
Scatter plot for NumHAcceptors vs NumValenceElectrons (Correlation Coefficient: 0.72)
Scatter plot for NumHAcceptors vs RingCount (Correlation Coefficient: 0.49)
Scatter plot for NumHAcceptors vs TPSA (Correlation Coefficient: 0.90)
Scatter plot for NumHAcceptors vs LabuteASA (Correlation Coefficient: 0.74)
Scatter plot for NumHAcceptors vs BertzCT (Correlation Coefficient: 0.74)
Scatter plot for NumHDonors vs NumHeteroatoms (Correlation Coefficient: 0.45)
Scatter plot for NumHDonors vs TPSA (Correlation Coefficient: 0.63)
Scatter plot for NumHeteroatoms vs NumValenceElectrons (Correlation Coefficient: 0.69)
Scatter plot for NumHeteroatoms vs TPSA (Correlation Coefficient: 0.89)
Scatter plot for NumHeteroatoms vs LabuteASA (Correlation Coefficient: 0.74)
Scatter plot for NumHeteroatoms vs BertzCT (Correlation Coefficient: 0.73)
Scatter plot for NumRotatableBonds vs NumValenceElectrons (Correlation Coefficient: 0.71)
Scatter plot for NumRotatableBonds vs LabuteASA (Correlation Coefficient: 0.65)
Scatter plot for NumValenceElectrons vs NumAromaticRings (Correlation Coefficient: 0.55)
Scatter plot for NumValenceElectrons vs RingCount (Correlation Coefficient: 0.63)
Scatter plot for NumValenceElectrons vs TPSA (Correlation Coefficient: 0.62)
Scatter plot for NumValenceElectrons vs LabuteASA (Correlation Coefficient: 0.97)
Scatter plot for NumValenceElectrons vs BertzCT (Correlation Coefficient: 0.85)
Scatter plot for NumAromaticRings vs RingCount (Correlation Coefficient: 0.77)
Scatter plot for NumAromaticRings vs LabuteASA (Correlation Coefficient: 0.60)
Scatter plot for NumAromaticRings vs BertzCT (Correlation Coefficient: 0.82)
Scatter plot for NumSaturatedRings vs NumAliphaticRings (Correlation Coefficient: 0.90)
Scatter plot for NumSaturatedRings vs RingCount (Correlation Coefficient: 0.48)
Scatter plot for NumAliphaticRings vs RingCount (Correlation Coefficient: 0.61)
Scatter plot for RingCount vs LabuteASA (Correlation Coefficient: 0.65)
Scatter plot for RingCount vs BertzCT (Correlation Coefficient: 0.80)
Scatter plot for TPSA vs LabuteASA (Correlation Coefficient: 0.65)
Scatter plot for TPSA vs BertzCT (Correlation Coefficient: 0.65)
Scatter plot for LabuteASA vs BertzCT (Correlation Coefficient: 0.89)

The bivariate analysis conducted through scatter plots provides a comprehensive understanding of the relationships between solubility and various molecular properties, as well as among the properties themselves. The analysis highlights both the strength and direction of these relationships through the correlation coefficients, offering valuable insights into the chemical behavior of the compounds under study.

The relationship between solubility and molecular weight (MolWt) reveals a moderate negative correlation. This indicates that as the molecular weight of a compound increases, its solubility tends to decrease. Heavier molecules generally exhibit lower solubility, which can be attributed to the larger size and increased complexity that hinder their dissolution in solvents.

A stronger negative correlation is observed between solubility and the log of the partition coefficient (MolLogP). This relationship suggests that as compounds become more lipophilic (higher MolLogP values), their solubility in aqueous environments decreases significantly. This trend aligns with the chemical understanding that lipophilic compounds are less soluble in water.

The scatter plot of solubility versus molecular refractivity (MolMR) shows a moderate negative correlation. Higher refractivity values, indicative of greater interaction potentials with solvents, are associated with lower solubility. This suggests that compounds with higher refractivity are less likely to dissolve easily.

The analysis of solubility versus heavy atom count reveals a moderate negative correlation, with a coefficient. Compounds with more heavy atoms are generally less soluble, indicating that increased atomic complexity may reduce solubility. Similarly, the scatter plot of solubility versus the number of valence electrons also shows a moderate negative correlation of -0.49, implying that compounds with more valence electrons tend to have lower solubility.

The relationship between solubility and the approximate surface area (Labute ASA) exhibits a moderate negative correlation. Larger surface areas are associated with lower solubility, suggesting that molecules with more exposed surface area are less likely to dissolve. The scatter plot of solubility versus BertzCT, a measure of molecular complexity, shows a moderate negative correlation, indicating that more complex molecules tend to be less soluble.

Moving to the relationships between molecular weight and other properties, a moderate positive correlation of 0.49 is observed between molecular weight and MolLogP. Heavier molecules tend to be more lipophilic. The scatter plot of molecular weight versus molecular refractivity (MolMR) shows a strong positive correlation, indicating that heavier molecules have higher refractivity, reflecting increased interaction potentials.

The analysis of molecular weight versus heavy atom count reveals a strong positive correlation. This suggests that as molecular weight increases, the number of heavy atoms also increases, indicating that larger molecules comprise more atoms. Similarly, the scatter plot of molecular weight versus the number of valence electrons shows a strong positive correlation indicating that heavier compounds tend to have more valence electrons.

The relationship between molecular weight and Labute ASA is characterized by a very strong positive correlation, indicating that as molecular weight increases, the accessible surface area increases significantly. The scatter plot of molecular weight versus BertzCT shows a strong positive correlation suggesting that heavier compounds are generally more complex.

Further analysis reveals strong positive correlations between molecular refractivity and other properties. The scatter plot of molecular refractivity versus heavy atom count shows a very strong positive correlation, indicating that compounds with higher refractivity tend to have more heavy atoms. The relationship between molecular refractivity and the number of valence electrons also shows a very strong positive correlation, suggesting that compounds with higher refractivity generally have more valence electrons.

The scatter plot of molecular refractivity versus Labute ASA exhibits a very strong positive correlation, indicating that as refractivity increases, so does the accessible surface area. The plot of molecular refractivity versus BertzCT shows a strong positive correlation, suggesting that more refractive compounds are generally more complex.

Examining the heavy atom count against other properties, the scatter plot of heavy atom count versus the number of valence electrons shows an almost perfect positive correlation, indicating that compounds with more heavy atoms tend to have more valence electrons. The plot of heavy atom count versus Labute ASA reveals a very strong positive correlation, suggesting that compounds with more heavy atoms have larger accessible surface areas. The scatter plot of heavy atom count versus BertzCT shows a strong positive correlation, indicating that compounds with more heavy atoms are generally more complex.

The relationships involving the number of valence electrons demonstrate significant positive correlations. The scatter plot of the number of valence electrons versus Labute ASA shows a very strong positive correlation, indicating that compounds with more valence electrons tend to have larger accessible surface areas. The plot of the number of valence electrons versus BertzCT shows a strong positive correlation, suggesting that compounds with more valence electrons tend to be more complex.

Labute ASA versus BertzCT shows a strong positive correlation, suggesting that compounds with larger accessible surface areas tend to be more complex. This comprehensive bivariate analysis highlights the intricate relationships among various chemical properties, offering valuable insights into the factors influencing solubility and molecular complexity.

Multivariate Analysis¶

Multivariate analysis involves examining more than two variables simultaneously to understand the complex relationships and interactions among them (Tabachnick & Fidell, 2013). Techniques such as multiple regression, principal component analysis (PCA), and cluster analysis are commonly used in multivariate analysis. These methods can identify patterns and dependencies that are not evident in univariate or bivariate analyses.

In [18]:
# Correlation matrix
plt.figure(figsize=(20, 8))
sns.heatmap(df[numerical_cols].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.show()

The correlation matrix provides a comprehensive overview of the relationships between various molecular properties and solubility. Each cell in the matrix represents the correlation coefficient between two properties, with values ranging from -1 to 1. The color gradient in the matrix visually depicts the strength and direction of these correlations, with red indicating positive correlations and blue indicating negative correlations. As mentioned above the correlation heatmap visually explains the relationship between several numeric columns in the dataset.

Principal Component Analysis¶

Principal Component Analysis (PCA) is a vital dimensionality reduction technique used to transform a large set of variables into a smaller, more manageable set that retains most of the original information. PCA identifies the principal components directions in which the data exhibits the most variation and projects the data onto these directions. This process is particularly valuable for simplifying datasets with many interrelated variables, enhancing computational efficiency, and minimizing noise.

In [19]:
# Select numerical columns for PCA
numerical_cols = df.select_dtypes(include=[np.number]).columns
data = df[numerical_cols].dropna()

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Apply PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)

# Create a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Plot the principal components
plt.figure(figsize=(10, 6))
plt.scatter(pca_df['PC1'], pca_df['PC2'], alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Dataset')
plt.grid(True)
plt.show()

# Print the explained variance ratio
print(f'Explained variance ratio: {pca.explained_variance_ratio_}')
Explained variance ratio: [0.46329573 0.13466843]
In [20]:
pca_df
Out[20]:
PC1 PC2
0 0.427680 1.811586
1 -0.863636 0.471152
2 -2.213924 0.425315
3 7.877660 2.215719
4 3.949742 0.375230
... ... ...
9977 0.033225 0.504861
9978 4.156564 -2.543008
9979 -1.962507 0.533362
9980 2.968548 1.681064
9981 1.319624 1.291313

9982 rows × 2 columns

For my datasets which has high dimensionality, PCA will effectively reduces the number of dimensions without significant information loss. This would be done by transforming the original variables into a new set of uncorrelated variables, ordered by the amount of variance they capture, PCA simplifies the dataset. The first principal component captures the most variance, followed by the second, and so forth. This dimensionality reduction facilitates easier visualization and analysis while preserving critical information.

The scatter plot depicting the first two principal components (PC1 and PC2) of the dataset reveals substantial insights, with these components together accounting for approximately 59.8% of the total variance (46.3% by PC1 and 13.5% by PC2). The plot shows a dense cluster of points near the origin, suggesting that many data points have similar values along these principal components. This indicates a high degree of similarity among these points concerning the original variables. Additionally, several outliers are scattered across the plot, highlighting data points with unique characteristics that set them apart from the majority. The clustering near the origin suggests that for the bulk of the data points, the variance captured by PC1 and PC2 is relatively low. This implies that these points share more similarities in the context of the original variables. Conversely, the outliers, especially those distant from the origin, demonstrate significant variance along the first two principal components, indicating unique or anomalous characteristics that distinguish them from the rest of the dataset.

The distribution spread along the PC1 and PC2 axes indicates that the data exhibits more variance captured by PC1 than by PC2, as reflected in the explained variance ratio. This suggests that the first principal component plays a more crucial role in differentiating the data points.

Cluster Analysis¶

Cluster analysis is a statistical technique used to group a set of objects into clusters so that objects within a cluster are more similar to each other than to those in other clusters (Kaufman & Rousseeuw, 2009). It helps in identifying patterns and structures in the data that may not be apparent through univariate or bivariate analysis. By applying cluster analysis, researchers can gain insights into the inherent grouping of chemical compounds, facilitating the identification of patterns that may influence solubility. This knowledge is valuable for predictive modeling, as it allows for the development of more targeted and effective models, enhancing the understanding and design of chemical compounds.

In [21]:
# Select numerical columns for clustering
numerical_cols = df.select_dtypes(include=[np.number]).columns
data = df[numerical_cols].dropna()

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Determine the number of clusters using the Elbow method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(scaled_data)
    wcss.append(kmeans.inertia_)

# Plot the Elbow method
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()

# Apply K-means clustering with the chosen number of clusters (e.g., 3)
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(scaled_data)

# Add the cluster labels to the original dataframe
df['Cluster'] = cluster_labels

# Apply PCA to reduce dimensions for visualization
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
pca_df['Cluster'] = cluster_labels

# Plot the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=pca_df, palette='viridis', alpha=0.7)
plt.title('K-means Clustering with 2 Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Cluster')
plt.grid(True)
plt.show()
In [22]:
from sklearn.metrics import silhouette_score

# Calculate the silhouette score
sil_score = silhouette_score(scaled_data, cluster_labels)
print(f'Silhouette Score: {sil_score:.2f}')
Silhouette Score: 0.24

For further cluster analysis, I would be using the Elbow Method. The Elbow method is a common technique used to determine the optimal number of clusters for K-means clustering. By plotting the within-cluster sum of squares (WCSS) against the number of clusters, the "elbow" point where the WCSS starts to decrease more slowly indicates the optimal number of clusters. In this analysis, the elbow point suggests that three clusters may be appropriate.

The K-means clustering scatter plot, using the first two principal components, shows how the dataset is partitioned into three clusters. The silhouette score however, which measures the quality of the clustering, is 0.24. This score indicates that while some points are well-clustered, there may be substantial overlap or ambiguity between clusters. The plot shows distinct clusters, with some overlap between them. The central cluster (Cluster 1) appears densely packed, while the other clusters (Clusters 0 and 2) are more spread out. The silhouette score of 0.24 indicates moderate clustering quality. This suggests that while the clusters are somewhat distinct, there is still room for improvement in terms of clearer separation and reducing overlap.

In [23]:
silhouette_scores = []

# Try different numbers of clusters
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    cluster_labels = kmeans.fit_predict(scaled_data)
    sil_score = silhouette_score(scaled_data, cluster_labels)
    silhouette_scores.append((k, sil_score))

# Find the best number of clusters
best_k = max(silhouette_scores, key=lambda x: x[1])[0]
print(f'Best number of clusters: {best_k}')

# Plot silhouette scores
ks, scores = zip(*silhouette_scores)
plt.figure(figsize=(10, 6))
plt.plot(ks, scores, marker='o')
plt.title('Silhouette Scores for Different Numbers of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()
Best number of clusters: 2
In [24]:
# Select numerical columns for clustering
numerical_cols = df.select_dtypes(include=[np.number]).columns
data = df[numerical_cols].dropna()

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Apply K-means clustering with the chosen number of clusters (2)
n_clusters = 2
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(scaled_data)

# Add the cluster labels to the original dataframe
df['Cluster'] = cluster_labels

# Apply PCA to reduce dimensions for visualization
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
pca_df['Cluster'] = cluster_labels

# Plot the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=pca_df, palette='viridis', alpha=0.7)
plt.title('K-means Clustering with 2 Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Cluster')
plt.grid(True)
plt.show()

# Calculate and print the silhouette score
from sklearn.metrics import silhouette_score
sil_score = silhouette_score(scaled_data, cluster_labels)
print(f'Silhouette Score with 2 clusters: {sil_score:.2f}')
Silhouette Score with 2 clusters: 0.58

Further analysis with different numbers of clusters and their corresponding silhouette scores reveals that the best number of clusters is 2, as indicated by the highest silhouette score of 0.58. The silhouette score is a measure of how similar an object is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters. The silhouette scores for different numbers of clusters show that the best score is achieved with 2 clusters, with a score of 0.58. This suggests that using 2 clusters provides the most distinct separation between groups.

In [25]:
# Add the cluster labels to the original dataframe
df['Cluster'] = cluster_labels

# Calculate mean and standard deviation of each feature within each cluster
cluster_profile = df.groupby('Cluster').agg(['mean', 'std'])
print(cluster_profile)

# Plot distributions of key features for each cluster
key_features = list(numerical_cols)
for feature in key_features:
    plt.figure(figsize=(10, 6))
    sns.histplot(data=df, x=feature, hue='Cluster', kde=True, element='step')
    plt.title(f'Distribution of {feature} by Cluster')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()
C:\Users\PC\AppData\Local\Temp\ipykernel_20600\2471440816.py:5: FutureWarning: ['ID', 'Name', 'InChI', 'InChIKey', 'SMILES', 'Group'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
  cluster_profile = df.groupby('Cluster').agg(['mean', 'std'])
        Solubility                  SD           Ocurrences            \
              mean       std      mean       std       mean       std   
Cluster                                                                 
0        -2.812590  2.310309  0.067663  0.233803   1.389298  1.038462   
1        -4.059434  2.873393  0.064203  0.248083   1.208401  0.741856   

              MolWt               MolLogP            ... RingCount            \
               mean         std      mean       std  ...      mean       std   
Cluster                                              ...                       
0        234.638873  102.527906  1.912118  2.739906  ...  1.318167  1.235001   
1        751.107756  371.492830  2.993354  9.221262  ...  4.507270  3.317775   

               TPSA               LabuteASA              BalabanJ            \
               mean         std        mean         std      mean       std   
Cluster                                                                       
0         53.358059   40.312605   95.191858   40.325074  2.484675  1.019385   
1        200.113489  141.277100  316.452448  154.260009  0.993415  1.181835   

             BertzCT               
                mean          std  
Cluster                            
0         376.921924   276.188422  
1        1834.952790  1293.466717  

[2 rows x 40 columns]

For solubility, Cluster 0 (blue) exhibits a wider range and higher frequency of solubility values compared to Cluster 1 (orange), suggesting that most compounds in Cluster 0 are more soluble. The molecular weight distribution in Cluster 0 is narrower with lower values, whereas Cluster 1 contains compounds with higher molecular weights. In terms of MolLogP, Cluster 0 has a narrower range centered around lower values, while Cluster 1 shows a broader distribution with higher values, indicating more lipophilic compounds. The distribution of molecular refractivity (MolMR) mirrors that of molecular weight (MolWt), with Cluster 0 displaying lower refractivity and Cluster 1 showing higher refractivity.

Cluster 0 contains compounds with fewer heavy atoms compared to Cluster 1. Similarly, Cluster 0 has fewer valence electrons compared to Cluster 1, following the pattern observed with heavy atom count. In terms of topological polar surface area (TPSA), Cluster 0 shows lower values, while Cluster 1 has significantly higher TPSA values. The approximate surface areas (LabuteASA) are lower in Cluster 0, whereas Cluster 1 has larger surface areas.

The frequency of lower BalabanJ values is higher in Cluster 0, indicating lower molecular complexity, while Cluster 1 exhibits higher complexity. Lastly, Cluster 0 has lower BertzCT values, whereas Cluster 1 has higher values, signifying more complex molecules.

In [26]:
# Descriptive statistics
cluster_descriptive_stats = df.groupby('Cluster').agg(['mean', 'median', 'std', 'min', 'max'])
print(cluster_descriptive_stats)
        Solubility                                                 SD         \
              mean    median       std        min       max      mean median   
Cluster                                                                        
0        -2.812590 -2.573579  2.310309 -13.171900  2.137682  0.067663    0.0   
1        -4.059434 -3.931000  2.873393 -11.998938  0.316973  0.064203    0.0   

                                  ...  BalabanJ                                \
              std  min       max  ...      mean    median       std       min   
Cluster                           ...                                           
0        0.233803  0.0  3.870145  ...  2.484675  2.595083  1.019385 -0.000004   
1        0.248083  0.0  2.191675  ...  0.993415  1.022751  1.181835 -0.000004   

                       BertzCT                                       \
              max         mean       median          std        min   
Cluster                                                               
0        7.517310   376.921924   325.349050   276.188422   0.000000   
1        7.372991  1834.952790  1552.141814  1293.466717  36.269827   

                       
                  max  
Cluster                
0         1517.612204  
1        20720.267708  

[2 rows x 100 columns]
C:\Users\PC\AppData\Local\Temp\ipykernel_20600\3970974338.py:2: FutureWarning: ['ID', 'Name', 'InChI', 'InChIKey', 'SMILES', 'Group'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
  cluster_descriptive_stats = df.groupby('Cluster').agg(['mean', 'median', 'std', 'min', 'max'])

Statistical Analysis¶

The t-test is a statistical test used to compare the means of two groups to determine if they are significantly different from each other. In this context, we are using the t-test to compare the means of various features between two clusters identified in the dataset. By performing this test, we can determine if the differences in the feature distributions between Cluster 0 and Cluster 1 are statistically significant.

The hypotheses for the t-tests are as follows:

\begin{align*} H_0 &: \mu_{0} = \mu_{1} \quad \text{(There is no significant difference in the mean of the feature between Cluster 0 and Cluster 1.)} \\ H_1 &: \mu_{0} \neq \mu_{1} \quad \text{(There is a significant difference in the mean of the feature between Cluster 0 and Cluster 1.)} \end{align*}\begin{align*} \text{where } \mu_{0} \text{ is the mean of the feature for Cluster 0, and } \mu_{1} \text{ is the mean of the feature for Cluster 1.} \end{align*}
In [27]:
from scipy.stats import ttest_ind

# Perform t-tests for each feature
features = numerical_cols
t_test_results = {}
for feature in features:
    cluster0_data = df[df['Cluster'] == 0][feature]
    cluster1_data = df[df['Cluster'] == 1][feature]
    t_stat, p_value = ttest_ind(cluster0_data, cluster1_data, equal_var=False)
    t_test_results[feature] = (t_stat, p_value)

# Display t-test results
for feature, (t_stat, p_value) in t_test_results.items():
    print(f"{feature}: t-statistic = {t_stat:.2f}, p-value = {p_value:.4f}")
Solubility: t-statistic = 10.57, p-value = 0.0000
SD: t-statistic = 0.34, p-value = 0.7360
Ocurrences: t-statistic = 5.71, p-value = 0.0000
MolWt: t-statistic = -34.50, p-value = 0.0000
MolLogP: t-statistic = -2.91, p-value = 0.0038
MolMR: t-statistic = -32.93, p-value = 0.0000
HeavyAtomCount: t-statistic = -35.77, p-value = 0.0000
NumHAcceptors: t-statistic = -26.52, p-value = 0.0000
NumHDonors: t-statistic = -16.55, p-value = 0.0000
NumHeteroatoms: t-statistic = -25.50, p-value = 0.0000
NumRotatableBonds: t-statistic = -17.97, p-value = 0.0000
NumValenceElectrons: t-statistic = -34.57, p-value = 0.0000
NumAromaticRings: t-statistic = -20.21, p-value = 0.0000
NumSaturatedRings: t-statistic = -4.54, p-value = 0.0000
NumAliphaticRings: t-statistic = -9.12, p-value = 0.0000
RingCount: t-statistic = -23.81, p-value = 0.0000
TPSA: t-statistic = -25.78, p-value = 0.0000
LabuteASA: t-statistic = -35.61, p-value = 0.0000
BalabanJ: t-statistic = 30.65, p-value = 0.0000
BertzCT: t-statistic = -28.00, p-value = 0.0000
Cluster: t-statistic = -inf, p-value = 0.0000
C:\Users\PC\AppData\Local\Temp\ipykernel_20600\4100536841.py:9: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
  t_stat, p_value = ttest_ind(cluster0_data, cluster1_data, equal_var=False)

The analysis reveals significant differences in molecular weight, with Cluster 1 having higher molecular weights compared to Cluster 0. Additionally, there is a significant difference in MolLogP values, indicating varying lipophilicity of the compounds between the clusters. Molecular refractivity also shows a significant difference, with Cluster 1 displaying higher refractivity.

The number of heavy atoms is significantly different between the clusters, with Cluster 1 having more heavy atoms. Similarly, there are significant differences in the number of hydrogen acceptors and donors, with Cluster 1 exhibiting higher values for both. The number of heteroatoms, rotatable bonds, and valence electrons also differ significantly, with higher values observed in Cluster 1.

The analysis shows that the number of aromatic, saturated, and aliphatic rings varies significantly between the clusters, with Cluster 1 having higher counts in each case. Ring count, topological polar surface area (TPSA), and approximate surface area (LabuteASA) also exhibit significant differences, with Cluster 1 having larger values for each feature.

In terms of molecular complexity, Cluster 1 demonstrates significantly higher BalabanJ and BertzCT values compared to Cluster 0. These differences highlight the distinct characteristics of the clusters based on the features analyzed, with Cluster 1 generally showing higher values and greater complexity across multiple dimensions.

Based on the results of the t-tests, we can reject the null hypothesis (H0) for almost all the features analyzed. The p-values are significantly low (close to 0) for most features, indicating that there are significant differences in the means of these features between Cluster 0 and Cluster 1. The statistical analysis confirms that there are substantial and significant differences between the two clusters in terms of solubility, molecular weight, MolLogP, molecular refractivity, heavy atom count, number of hydrogen acceptors and donors, number of heteroatoms, number of rotatable bonds, number of valence electrons, number of aromatic, saturated, and aliphatic rings, ring count, TPSA, LabuteASA, BalabanJ, and BertzCT. These findings indicate that the clusters represent distinct groups with differing chemical and physical properties.

In [28]:
# Correlation matrices for each cluster
corr_cluster0 = df[df['Cluster'] == 0][numerical_cols].corr()
corr_cluster1 = df[df['Cluster'] == 1][numerical_cols].corr()

# Plot correlation matrices
import seaborn as sns
plt.figure(figsize=(20, 10))
sns.heatmap(corr_cluster0, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix for Cluster 0')
plt.show()

plt.figure(figsize=(20, 10))
sns.heatmap(corr_cluster1, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix for Cluster 1')
plt.show()

Model Building¶

Model building is a crucial step in data analysis, involving the selection and application of statistical or machine learning algorithms to develop predictive models based on the dataset. In this analysis, linear regression, random forest, and decision tree models were utilized to predict the solubility of chemical compounds based on various molecular descriptors. Using these models is essential as they provide different perspectives and strengths for analyzing the dataset. Linear regression offers a straightforward approach, while decision trees and random forests handle non-linear relationships and interactions among molecular descriptors. By comparing the performance of these models, one can select the most effective approach for accurate and reliable solubility prediction.

In [29]:
features = df.drop(columns=['Solubility', 'Cluster'])  # Exclude target and cluster label
target = df['Solubility']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
In [30]:
# Check for non-numeric columns
non_numeric_cols = df.select_dtypes(exclude=[np.number]).columns
print("Non-numeric columns:", non_numeric_cols)
Non-numeric columns: Index(['ID', 'Name', 'InChI', 'InChIKey', 'SMILES', 'Group'], dtype='object')
In [31]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encode non-numeric columns
df = pd.get_dummies(df, columns=non_numeric_cols, drop_first=True)
In [32]:
# Select features and target
features = df.drop(columns=['Solubility', 'Cluster'])  # Exclude target and cluster label
target = df['Solubility']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Initialize the models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42)
}

# Train and evaluate each model
results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    
    results[model_name] = {'MSE': mse, 'RMSE': rmse, 'R-squared': r2, 'MAE': mae}
In [33]:
# Display the results
results_df = pd.DataFrame(results).T
print(results_df)
                        MSE      RMSE  R-squared       MAE
Linear Regression  2.715572  1.647899   0.499401  1.205257
Decision Tree      2.043242  1.429420   0.623341  0.959217
Random Forest      1.226417  1.107437   0.773918  0.756546

The provided results summarize the performance of three different regression models: Linear Regression, Decision Tree, and Random Forest. The metrics used to evaluate the models are Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R²), and Mean Absolute Error (MAE). Each of these metrics provides insights into the accuracy and effectiveness of the models in predicting solubility based on the features in the dataset.

The Linear Regression model has an MSE of 2.716, RMSE of 1.648, R-squared of 0.499, and MAE of 1.205. These values indicate that while the model captures some variance in the target variable, its performance is relatively moderate. The R-squared value of 0.499 suggests that approximately 50% of the variability in solubility can be explained by the features used in the model.

The Decision Tree model performs better than the Linear Regression model, with an MSE of 2.043, RMSE of 1.429, R-squared of 0.623, and MAE of 0.959. The improvement in these metrics, particularly the higher R-squared value of 0.623, indicates that the Decision Tree model explains about 62% of the variance in solubility. This improvement suggests that the Decision Tree model can capture more complex patterns in the data compared to Linear Regression.

The Random Forest model shows the best performance among the three models, with an MSE of 1.226, RMSE of 1.107, R-squared of 0.774, and MAE of 0.757. The low MSE and RMSE values indicate that the model's predictions are closer to the actual solubility values. The R-squared value of 0.774 means that the Random Forest model explains approximately 77% of the variance in solubility, highlighting its effectiveness in capturing the underlying relationships in the dataset. The lower MAE value further confirms the model's accuracy in predicting solubility.

From the analysis, it is clear that the Random Forest model outperforms both the Linear Regression and Decision Tree models in predicting solubility. The significant improvement in all the evaluation metrics for the Random Forest model suggests that it is the most suitable model for this particular dataset. The Random Forest model's ability to handle non-linear relationships and interactions between features makes it a robust choice for this regression task.

Conclusion¶

The comprehensive analysis conducted on the dataset has provided valuable insights into the solubility and various molecular properties of compounds, as well as the performance of different predictive models. The EDA revealed significant variability in the physical and chemical properties of the compounds. The distribution analysis highlighted the diversity within the dataset, with solubility values mostly concentrated around -2 to 0, molecular weights exhibiting a right-skewed distribution, and other properties such as MolLogP and MolMR showing diverse patterns. The box plots provided a detailed summary of the data's distribution, highlighting the median, quartiles, and potential outliers for each feature.

The bivariate analysis demonstrated significant relationships between solubility and various molecular properties. A moderate negative correlation between solubility and molecular weight, MolMR, and heavy atom count, A strong negative correlation was observed between solubility and MolLogP. Moderate to strong positive correlations between molecular weight and other properties like MolLogP, MolMR, and heavy atom count.

We then did cluster analysis using K-means revealed two distinct clusters with significant differences in solubility and molecular properties. Cluster 0 comprised more soluble compounds with lower molecular weights, fewer heavy atoms, and lower refractivity values. Cluster 1 contained less soluble compounds with higher molecular weights, more heavy atoms, and higher refractivity values.

PCA helped reduce the dimensionality of the dataset, facilitating the visualization of clusters and highlighting the variance captured by the first two principal components. The silhouette scores supported the selection of the optimal number of clusters.

For statistical analysis, The t-tests conducted on the features between the two clusters indicated significant differences in the mean values of most features, leading to the rejection of the null hypothesis for almost all features. This suggests that the two clusters have distinct characteristics across various molecular properties.

Three regression models were evaluated for predicting solubility: Linear Regression, Decision Tree, and Random Forest. Among these, the Random Forest model performed the best, achieving the lowest MSE and RMSE, the highest R-squared value, and the lowest MAE. This indicates that Random Forest is the most effective model for predicting solubility in this dataset.

The analysis has provided a deep understanding of the relationships between solubility and molecular properties, the distinct clusters within the dataset, and the effectiveness of various predictive models. The findings highlight the complexity and diversity of the compounds and underscore the importance of using advanced techniques like PCA and Random Forest for effective analysis and prediction. The study concludes that the Random Forest model is the most suitable for predicting solubility, given its superior performance across all evaluated metrics.

References¶

  • Breiman, L., 2001. Random Forests. Machine Learning, 45(1), pp.5-32.

  • Cohen, J., Cohen, P., West, S.G. and Aiken, L.S., 2013. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. 3rd ed. New York: Routledge.

  • Dancey, C.P. and Reidy, J., 2011. Statistics Without Maths for Psychology. 5th ed. Harlow: Pearson Education Limited.

  • Everitt, B.S., Landau, S., Leese, M. and Stahl, D., 2011. Cluster Analysis. 5th ed. Chichester: Wiley.

  • Field, A., 2013. Discovering Statistics Using IBM SPSS Statistics. 4th ed. London: SAGE Publications.

  • Hair, J.F., Black, W.C., Babin, B.J. and Anderson, R.E., 2014. Multivariate Data Analysis. 7th ed. Harlow: Pearson Education Limited.

  • Hou, T., Xu, X. and Lee, S., 2009. ADME Evaluation in Drug Discovery. 1. Applications of Genetic Algorithms to the Prediction of Blood-Brain Barrier Penetration. Journal of Chemical Information and Modeling, 49(2), pp.133-144.

  • Kaufman, L. and Rousseeuw, P.J., 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken: Wiley-Interscience.

  • Kotsiantis, S., Kanellopoulos, D. and Pintelas, P., 2006. Data preprocessing for supervised learning. International Journal of Computer Science, 1(2), pp.111-117.

  • Lipinski, C.A., Lombardo, F., Dominy, B.W. and Feeney, P.J., 2001. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews, 46(1-3), pp.3-26.

  • Montgomery, D.C., Peck, E.A. and Vining, G.G., 2012. Introduction to Linear Regression Analysis. 5th ed. Hoboken: Wiley.

  • Quinlan, J.R., 1986. Induction of Decision Trees. Machine Learning, 1(1), pp.81-106.

  • Tabachnick, B.G. and Fidell, L.S., 2013. Using Multivariate Statistics. 6th ed. Boston: Pearson Education.

  • Weinberg, S.L. and Abramowitz, S.K., 2008. Statistics Using SPSS: An Integrative Approach. 2nd ed. Cambridge: Cambridge University Press.

  • Zhu, X., Liu, Q., Yan, Q. and Xu, Y., 2019. Data Preprocessing in Web Usage Mining. International Journal of Computer Science and Information Security (IJCSIS), 17(9), pp.56-60.

In [ ]: